首页> 外文OA文献 >A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar
【2h】

A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar

机译:使用条件随机字段和随机正则语法的概率地址解析器

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semiMarkov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.
机译:来自数据库或网络的数据的自动语义注释是数据清理和记录链接的重要预处理。它可用于解决数据库中字段对齐不完善的问题,或标识可比较的字段以匹配来自多个源的记录。由于数据值可能是嘈杂的,例如缩写,变体或拼写错误,因此注释过程并非易事。特别是,重叠特征通常存在于基于词典的方法中。在这项工作中,我们提出了一种基于线性链条件随机字段(CRF)的概率地址解析器,与隐马尔可夫模型(HMM)相比,该解析器可以提供更具表达力的令牌级功能。此外,我们还提出了两种通用的增强技术来提高性能。一种是考虑数据的原始半结构。另一个是通过结合解析器的条件概率和得分函数对解析器的输出序列进行后处理,该函数基于捕获段级依赖性的学习的随机正则语法(SRG)。通过在两个真实的数据集中比较CRF解析器,HMM解析器和半Markov CRF解析器来进行实验。就分类准确性而言,CRF解析器在两个数据集中均优于HMM解析器和SemiMarkov CRF。利用数据的结构并将线性链CRF与SRG结合使用,进一步改进了解析器,在邮政数据集上的准确性达到97%,在公司数据集上的准确性达到96%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号